High perforamance machine learning inference framework for edge devices

ABSTRACT

Techniques for high-performance machine learning (ML) inference in heterogenous edge devices are described. A ML model trained using a variety of different frameworks is translated into a common format that is runnable by inferences engines of edge devices. The translated model is optimized in hardware-agnostic and/or hardware-specific ways to improve inference performance, and the optimized model is sent to the edge devices. The inference engine for any edge device can be accessed by a customer application using a same defined API, regardless of the hardware characteristics of the edge device or the original format of the ML model.

BACKGROUND

With recent advancements in machine learning, a natural next step is todeploy models on “edge” devices in various environments, such as “smart”cameras, mobile devices such as smart phones, smart speakers, in motorvehicles, etc. This configuration can have the potential to allowinferences to be generated quicker (e.g., on a same device that obtainsthe data upon which the inference is generated, instead of remotely—suchas in a cloud network or other centralized location) and enable fasterreactions to these inferences to occur.

However, the hardware available to generate inferences (e.g., processingunits such as central processing units (CPUs), graphical processingunits (GPUs), tensor processing units (TPUs), field programmable gatearrays (FGPAs), etc., the amounts and types of available memory, etc.)and the architectures of these hardware resources (e.g., instruction setarchitectures (ISAs) such as x86, ARM, MIPS, SPARC) vary significantlyfrom one edge device to another. A consequence of these variations inhardware and architecture is that optimal inference speeds can only beachieved with vendor-specific software. This creates an undesirablecoupling between hardware and software, which leads to applicationsbeing extremely difficult to port from one device type to another devicetype.

BRIEF DESCRIPTION OF DRAWINGS

Various embodiments in accordance with the present disclosure will bedescribed with reference to the drawings, in which:

FIG. 1 is a diagram illustrating an exemplary environment including aunified inference framework server module according to some embodiments.

FIG. 2 is a diagram illustrating an exemplary architecture forhigh-performance machine learning inference according to someembodiments.

FIG. 3 is a diagram illustrating an exemplary environment with completeor partial edge optimization according to some embodiments.

FIG. 4 is a diagram illustrating an exemplary environment with completeor partial provider network optimization according to some embodiments.

FIG. 5 is a flow diagram illustrating operations of a method forimplementing high-performance machine learning inference in edge devicesaccording to some embodiments.

FIG. 6 is a block diagram of an illustrative operating environment inwhich machine learning models are trained and hosted according to someembodiments.

FIG. 7 illustrates an example provider network environment according tosome embodiments.

FIG. 8 is a block diagram of an example provider network that provides astorage service and a hardware virtualization service to customersaccording to some embodiments.

FIG. 9 is a block diagram illustrating an example computer system thatmay be used in some embodiments.

FIG. 10 illustrates an example of an environment for implementingaspects in accordance with various embodiments.

DETAILED DESCRIPTION

Various embodiments of methods, apparatus, systems, and non-transitorycomputer-readable storage media for high-performance machine learninginference for heterogenous edge devices are described. According to someembodiments, a unified inference framework for heterogenous edge devicesis provided that can accept machine learning (ML) models from a user inany of multiple formats (as generated by multiple different frameworks),convert and optimize these ML models for use by heterogeneous “edge”devices having heterogeneous computing resources, and deploy these MLmodels for use in one or more edge devices of one or more differenttypes.

In some embodiments, the ML models can be optimized—possibly in ahardware specific or hardware class specific way—at the memory and/orscheduling level to achieve high computing performance on differenttypes of hardware provided by different edge devices. In someembodiments, the ML models can be simply deployed to different edgedevices, resulting in the ML models being extremely portable in wherethey can be run. For example, in some embodiments the framework can workwith both a graphics processing unit (GPU) and central processing unit(CPU), across different hardware architectures such as an x86 CPU, ARMCPU, and/or GPU—or even to devices utilizing field programmable gatearrays (FPGAs) or microcontrollers. As a result, users seeking to run aML model on a particular device or group of devices can be insulatedfrom the hardware-specific challenges in such a deployment—instead, uponrequesting a deployment, everything will “just work” from the user'sstandpoint. Further, some embodiments provide a single, unified set ofuser application programming interfaces (APIs) for users to performmodel optimization and/or inference, which can allow for a user'sapplication code to be simple via use of a single set of APIs enablinguse with potentially multiple different types of devices.

As indicated herein, the field of machine learning (and the specificcategory of deep learning) is developing extremely quickly. Many peopleand organizations are looking to machine learning to improve systemavailability through predictive maintenance, invent entirely newexperiences on behalf of their customers, lower costs throughautomation, etc. In some cases, Internet of Things (IoT) devices—alsocommonly referred to as “edge” devices—are poised to play a central rolein driving these improvements as running machine learning becomes moreefficient and edge hardware capabilities continue to accelerate.However, successfully implementing machine learning at the edge in asustainable and manageable way is elusive. First, due to their size,machine learning models are cumbersome to manage and deploy reliably. Asa result, such models are rarely deployed to edge devices, lessening thelikelihood of continually improving capabilities through the re-trainingof models. Second, many original equipment manufacturers (OEMs) andpartners invest significant resources in developing hardware-specificoptimizations to achieve adequate performance, and then have tohand-tune models for their specific environment. This can take manymonths and requires extremely deep knowledge of both hardware andmachine learning. Moreover, data collected by edge devices often ends upgoing to waste since improved models are slow to engineer, and risky todeploy and manage at scale, meaning that once deployed, edge-based IoTstrategies calcify and become brittle over time. Finally, as it isimportant for ML models running inference to be extremely efficient(e.g., to execute quickly due to a large amount of data requiringinference), deploying models to edge devices becomes extremely difficultwhen the edge devices have heterogeneous hardware resources—e.g., theexistence or non-existence of CPU cores, GPUs, FPGAs, etc., differingarchitectures (x86, ARM, etc.), different resource amounts andavailabilities (e.g., amounts of random access memory (RAM)), etc., Thusit is difficult to deploy efficient, optimized models to run inferenceto different devices having different resources.

Embodiments disclosed herein allow users to implement machine learningat the edge in a manageable and sustainable way. Further, someembodiments can provide a flywheel for ML on connected devices, allowingusers to deploy models across potentially millions of devices andcontinually improve them. Data can be collected in a secure manner fromdevices and used to train new models, which can be re-deployed todevices. This in turn generates new data for additional cycles ofre-training and re-deployment, where each cycle can increaseavailability of the system and improve device experiences.

Embodiments can utilize a provider network-based automated verificationof model integrity, and/or model deployments and versions can be trackedin a provider network, giving a real-time view into deployment statusacross a fleet of edge devices.

Once ML models are deployed to the device(s), inferencing can beexecuted using a fast machine learning inference engine, which mayleverage efficient thread-pooling and/or reinforcement learning toautomatically optimize models for any hardware platform. Embodiments canalso allow developers to quickly and continually improve their models.By automatically collecting low-confidence predictions from devices andsyncing them with a provider network, these ‘gaps’ in the model can beclosed and corrected (e.g., through human annotation or othertechniques).

Thus, instead of developers needing to build and install a particular MLframework chosen by a data scientist for training on target devices,learn how to use the ML framework, writing ML framework-specific code intheir application to load the model, prepare input for inference, andrun inference to get prediction, developers can simply provide a trainedmodel, send a simple request to deploy it to one or more electronicdevices (which may or may not be homogeneous), and use a pre-defined APIin their applications to perform inference.

FIG. 1 is a diagram illustrating an exemplary environment including aunified inference framework (“UIF”) server module 112 according to someembodiments. The UIF server module 112, in some embodiments, is aportion of software allowing users 118 to deploy and manage highperformance machine learning models 130 running on connected devices 122in production. Users 118 (e.g., individuals, organizations, even OEMs)can import to—or train machine learning models in—a provider network 100(“the cloud”), and reliably deploy these models to large numbers ofdevices 122 at the edge. Embodiments can—via a model optimizer 114 of aUIF server module 112 and/or UIF client module 124—automatically tune MLmodels for optimal performance across multiple underlying hardwareplatforms, resulting in improved prediction/inference speeds that allowsophisticated computer vision, audio, and anomaly detection models torun efficiently, even on low-power devices.

The UIF server module 112 may be software executed by one or moremultiple computing devices of a provider network 100 and may serve aspart of an edge device management service 110 (e.g., a service ofmultiple services 102 provided by the provider network 100), though insome embodiments the UIF server module 112 may operate as part ofanother service 102 or as a component of another application. A providernetwork 100 provides users 118 with the ability to utilize one or moreof a variety of types of computing-related resources such as computeresources (e.g., executing virtual machine (VM) instances and/orcontainers, executing batch jobs, executing code without provisioningservers), data/storage resources (e.g., object storage, block-levelstorage, data archival storage, databases and database tables, etc.),network-related resources (e.g., configuring virtual networks includinggroups of compute resources, content delivery networks (CDNs), DomainName Service (DNS)), application resources (e.g., databases, applicationbuild/deployment services), access policies or roles, identity policiesor roles, machine images, routers and other data processing resources,etc. These and other computing resources may be provided as services,such as a hardware virtualization service that can execute computeinstances, a storage service that can store data objects, etc. The users118 (or “customers”) of provider networks 100 may utilize one or moreuser accounts that are associated with a customer account, though theseterms may be used somewhat interchangeably depending upon the context ofuse. Users may interact with a provider network 100 across one or moreintermediate networks (e.g., the internet) via one or more interface(s),such as through use of application programming interface (API) calls,via a console implemented as a website or application, etc. Theinterface(s) may be part of, or serve as a front-end to, a control planeof the provider network 100 that includes “backend” services supportingand enabling the services that may be more directly offered tocustomers.

To provide these and other computing resource services, providernetworks 100 often rely upon virtualization techniques. For example,virtualization technologies may be used to provide users the ability tocontrol or utilize compute instances (e.g., a VM using a guest operatingsystem (O/S) that operates using a hypervisor that may or may notfurther operate on top of an underlying host O/S, a container that mayor may not operate in a VM, an instance that can execute on “bare metal”hardware without an underlying hypervisor), where one or multiplecompute instances can be implemented using a single electronic device.Thus, a user may directly utilize a compute instance hosted by theprovider network to perform a variety of computing tasks, or mayindirectly utilize a compute instance by submitting code to be executedby the provider network, which in turn utilizes a compute instance toexecute the code (typically without the user having any control of orknowledge of the underlying compute instance(s) involved).

According to some embodiments, a user 118 may utilize a client device(e.g., client device 120) such as a Personal Computer (PC), laptop orother mobile device such as a smartphone, tablet, etc., to manage thedeployment of a ML model to one or more edge devices 122A-122N.

Ones of the edge devices 122A-122N may be preconfigured (e.g., at orafter manufacture time, such as by an OEM or other provider) with a UIFclient module 124. For example, an edge device 122A may include firmwareor other pre-installed software including the UIF client module 124.Alternatively, a user may cause the UIF client module 124 to be deployedto the one or more edge devices 122A-122N. For example, the user 118 mayuse client device 120 to “log in” to an edge device 122A (e.g., via SSH,telnet, web application, etc.) and issue commands to the device 122A toinstall software (including the UIF client module 124), etc., at circle(A1). As part of installing this software, the UIF client module 124 maycontact an edge device management service 110 and provide (or acquire,as assigned by the UIF server module 112) an identifier of itself, anidentifier of the user 118 (or user/customer account), etc.

The edge device management service 110 may comprise software executed byone or multiple server computing devices of a provider network 100 thatallows users to send requests (e.g., web service application programminginterface (API) calls) to deploy, update, and/or otherwise utilizecomputing devices (e.g., edge devices 122A-122N) that execute outside ofthe provider network 100.

The user 118 may also more directly register one or more edge devices122A-122N with the edge device management service 110 of the providernetwork 100. As shown at circle (A2), the user 118 may cause the clientdevice 120 to send one or more registration request message(s) to theedge device management service 110 to “register” these devices, whichmay associate the devices with an account of the user within theprovider network 100. The registration request message(s) may includeidentifiers of these devices, such as a media access control (MAC)address of each device, descriptions/categorizations of these devices(e.g., a manufacturer, model name and/or number), network addresses(e.g., Internet Protocol (IP) addresses) for the device(s), public keyinformation of each device, etc. This registration may cause the edgedevice management service 110 to send UIF client module 124 code (e.g.,source code, packages, binaries, etc.) to the edge device(s) 122 asshown at circle (B) to be installed (or “provisioned”) at the one ormore edge devices 122A-122N.

To manage and deploy ML models to the one or more edge devices122A-122N, the user 118 may obtain and provide a ML model 108 in avariety of ways. For example, as shown by optional circle (1), a user118 may utilize a client device 120 to interact with a machine learning(ML) service 104 (of a same provider network 100, different providernetwork, or otherwise implemented) to train a ML model 108, which may bestored in a storage service 116 of the provider network 100. Theseinteractions may be implemented using web service calls (e.g., HyperTextTransfer Protocol (HTTP) request messages) sent to one or more endpointsassociated with the ML service(s) 104 and/or provider network 100,causing the ML service to train a ML model using a particular MLframework, training data, hyperparameters, etc. Additional detaildescribing exemplary systems and techniques for generating ML models ispresented herein with regard to FIG. 6.

In this scenario, the user 118 may cause the client device 120 to send,at circle (2), a request to deploy the ML model(s) 108 to one or moreedge devices 122A-122N. The request may identify a storage locationwhere the model is located (e.g., a URL/URI where the model file orfiles are available, which may be within or outside of the providernetwork 100), as well as identify a particular edge device 122A,multiple individual edge devices 122A-122N, or a group of edge devices(e.g., “store_security_cameras”) to deploy the ML model 108 to. In thecase of multiple edge devices or a group of edge devices, the devicesmay be homogeneous (and thus have homogenous hardware computingresources) or heterogeneous (and thus have differing hardware computingresources). With the identifier of the model storage location, the edgedevice management service 110 may obtain the one or more files making upthe ML model 108.

As another example, the ML model 108 may be provided with the request bythe client device 120 directly to the edge device management service 110(e.g., as an uploaded set of files, via a form submission of a webpageor FTP upload, etc.). This may be beneficial when, for example, the user118 as trained the ML model 108 using another system (outside of theprovider network 100), and simply seeks to provide an existing model tothe edge device management service 110.

In response to the request, the UIF server module 112 may perform one ormore optimization tasks with a model optimizer 114A. The model optimizer114A may be a software module executed by one or more computing devicesand may perform various general and/or device-specific optimizations ofthe ML model 108.

For example, in some embodiments a translator 113A module (e.g.,library, function, binary, etc.) of the model optimizer 114A translatesthe ML model 108 from a first format (as generated by a particular MLframework) into a “common” second format that is “unified” in that itcan be run by all inference engines 132 of all edge devices 122.Generally speaking, ML models are generated in a format that is specificto a particular framework. For example, a model trained using theTensorFlow framework is created in a particular format, which needs tobe run by a system that can accommodate that particular format. Thus, adevice having another framework installed (e.g., MXNet, Caffe, PyTorch,Microsoft Cognitive Toolkit, etc.) will not be able to run the model,and the same is true for other models created by use of otherframeworks. Moreover, many of these frameworks are so large andcomputationally intensive that they are not suited for deployment onoften (comparatively) resource-constrained edge devices. Thus, bytranslating the ML model 108 into a common format, the model optimizer114A can make the model “portable” in that it can be run at any/all ofthe one or more edge devices 122A-122N using a common inference engine132 (e.g., via use of an inference library 134). This translation can beperformed using tools and techniques known to those of skill in the art,which may include using a conversion library/module that identifiescertain values (e.g., weights) in model files generated by a firstframework and inserts them in a different format (or location) withinfiles adherent to a different framework or format (e.g., a differentframework's format, a standardized “generic” format such as the OpenNeural Network eXchange format “ONNX”, etc.). As another example, thetranslator may create low-level machine code that is executable formultiple types of hardware backends. However, in some embodiments thetranslator module 113B may be implemented at a particular edge device122A and thus the translation may occur there.

In some embodiments, the (possibly translated) model can be provided tothe one or more edge devices 122A-122N (e.g., as indicated by thedeployment request) directly (via circle (3A)), or by storing the (atleast partially) optimized model 129 (e.g., a translated model, atranslated and partially optimized model, an optimized model, etc.) in astorage service 116 at circle (3B), where it can be obtained by the oneor more edge devices 122A-122N as shown at circle (3C), e.g., via theone or more edge devices 122A-122N sending requests (e.g., web servicerequests) to download the optimized model 129 files.

However, in some embodiments the UIF server module 112 may also performother model optimizations (e.g., in addition to, or alternatively to,performing the translation described above). For example, the UIF servermodule 112 may perform computational graph optimizations (that arehardware agnostic) and/or hardware-specific optimizations.

In some embodiments, the model optimizer 114 may optimize the model byperforming layer fusion or similar optimizations. As is known to thoseof skill in the art, some machine learning models (e.g., some types ofneural networks) can be modified—e.g., layers with unused output can beeliminated to avoid unnecessary computation, certain layers (e.g.,certain convolutional layers, batch norm layers, bias layers, and/orReLU activation layers) can be combined or “fused” to form a singlelayer, layers can be combined layers via layer aggregation that take asame source tensor and apply the same operations with similar parametersto result in a single larger layer for higher computational efficiency.As another example, in some embodiments the model optimizer 114 mayoptimize the model by performing quantization where certain data typesmay be changed (e.g., floating point values can be changed to integers)to reduce inference computational latency (albeit at a potential expenseof accuracy). Moreover, in some embodiments, the model optimizer 114 mayoptimize the model by analyzing the model and performing kernel fusion(e.g., similar to layer fusion, albeit one layer down in the stack andthus different kernels for different operators can be fused together),etc. As an additional example, the optimizations may include customizingthe model based on a device context of the edge device(s) 122, e.g.,what specific processor(s) each has, drivers each has, graphicsprocessing unit(s) each has, etc.

Thus, an optimized model 129—whether it is simply translated, optimized,or both translated at optimized for execution—is provided to theidentified one or more edge devices 122A-122N. In some embodiments, asshown at circle (4A), the at least partially optimized model 129 may befurther optimized by an on-device model optimizer 114B, which mayperform any or all of the above-described optimizations and may also oralternatively perform optimizations that are hardware specific. Forexample, the model optimizer 114B may translate the model using atranslator 113B, implement hardware-agnostic optimizations, or optimizethe model for improved (or optimal) scheduling (of how to runcomputation). As an example, in some embodiments the model optimizer114B may implement data-driven scheduling, which includes runninginference (using the model) for some amount of time, and then backingout what the optimal schedule is from observing the data.

Thus, the (possibly optimized) model provided to each edge device 122 at(3A) or (3C) may be further optimized (at (4A)) or not further optimized(circle (4B)), and then provided to an inference engine 132 as optimizedmodel 130. The optimized model 130 may be in a variety of differentformats based on the particular implementation. As one example, theoptimized model 130 can include a first file carrying a graph-basedrepresentation of the model (e.g., in JSON/XML) indicating the structureof a neural network, and include another file carrying the modelweights. As another example, the optimized model 130 may include threeIntermediate Representation (IR) files: a JSON file describing theoptimized graph, a “params” file saving the values of model parameters,and a “so” file for the inference engine to run the model inference.

The one or more edge devices 122A-122N may then operate as intended,e.g., by an application 127 capturing/creating input data via one ormore sensors 128 (e.g., optical sensors, audio sensors, temperaturesensors, humidity sensors, air pressure sensors, gas sensors, moisturesensors, water flow sensors, weight sensors, motion sensors, globalpositioning system (GPS) sensors, rotation/acceleration sensors, radiosensors, biological sensors (e.g., pulse sensors), fingerprint sensors,and the like. This input data, at circle (5), is provided to theinference engine 132 which at circle (6) can perform inference using theoptimized model 130 and optionally logic of an inference library 134.The inference engine 132 itself, or another application 127 executed bythe one or more edge devices 122A-122N, at optional circle (7) mayperform actions based on the inference values (or when the inferencevalues satisfy some condition, e.g., when they exceed a thresholdvalue). For example, the corresponding input data (and/or previous orsurrounding input data points, and/or the inference predictionsthemselves) may be stored in a non-volatile storage, sent in a networkmessage to a storage service 116 in a same or different network (e.g.,in provider network 100) for further analysis, etc. As another example,the input data or inferences could be sent to a machine learningservice, sent to a data monitoring/logging service, stored in adatabase, sent to a serverless code execution service to be processed,etc., allowing users to take “local” inference results generated by edgedevices and integrate these results into an overall application 127 innearly any manner desired by the users.

Notably, in some embodiments the model optimizer 114B and/or inferenceengine 132 may utilize a common set of APIs (across disparate types ofdeployments) to optimize a model, load a model, and/or performinference, allowing the application 127 to be easily written to interactwith the model optimizer 114B and/or inference engine 132, and flexiblybe deployed in a number of different hardware environments. For example,the UIF client module 124 may expose an “optimize” method toapplications that allows the application to obtain a model (e.g., bydownloading a translated and/or optimized model from the storageservice(s) 116 or edge device management service 110 and have the localmodel optimizer 114B optimize the model. As another example, the UIFclient module 124 may expose a “model” method (including an argumentidentifying a path or location of an optimized model) that creates/loadsa model instance in the inference engine 132, and a “Model.doInference”method (with a parameter comprising the input data—e.g., an image) uponwhich the inference is to be performed.

For an abstracted view of certain aspects disclosed herein, FIG. 2 is adiagram illustrating an exemplary architecture for high-performancemachine learning inference according to some embodiments. As shown, anumber of framework-specific models 204 generated using a number ofmachine learning frameworks 202 (e.g., MXNet, TensorFlow, Caffe, etc.)can be provided to one or more model optimizers 114A/114B to betranslated and optimized, resulting in an optimized model being providedto an inference engine 132 that can run the optimized model with inputdata 206 generated using hardware resources 126 (such as one or moresensors 128) to generate inferences 208. As alluded to herein, in someembodiments the UIF client module 124 may be a software stack providedto users, or pre-installed on certain edge devices, so users don't haveto do anything other than just provision/register the device(s). Inother embodiments, the users may instead self-provision and install theUIF server module 112 on the device(s) themselves.

FIG. 3 is a diagram illustrating an exemplary environment with completeor partial edge optimization according to some embodiments. As shown atcircle (A1), in some embodiments a ML model 108 in a format pertainingto a particular framework can be provided to a UIF server module 112that can perform translation of the model into another common format,which may be sent as model 129 at circle (B) to be optimized by eachon-device model optimizer 114B and then used by the inference engine132. Alternatively, the UIF server module 112 can perform translation ofthe model into another common format and also (partially) optimize themodel, resulting in a translated and partially-optimized model 129 thatmay be sent at circle (B) to be optimized by each on-device modeloptimizer 114B and then used by the inference engine 132 as model 130.Alternatively, the UIF server module 112 may simply provide the ML model108 at circle (A2) to be translated and possibly further optimized byeach on-device model optimizer 114B and then used by the inferenceengine 132 as model 130

As another example, FIG. 4 is a diagram illustrating an exemplaryenvironment with complete or partial provider network optimizationaccording to some embodiments. As shown at circle (A), in someembodiments a ML model 108 in a format pertaining to a particularframework can be provided to a UIF server module 112 that can performtranslation (and optionally, partial optimization) of the model intoanother common format, which may be sent at circle (B) to optionally befurther optimized by each on-device model optimizer 114B and then usedby the inference engine 132. Alternatively, the UIF server module 112can perform translation of the model into another common format and alsopartially or completely optimize the model, resulting in an optimizedmodel that may be sent at circle (C) to be used by the inference engine132 of each edge device without on-device optimizations.

FIG. 5 is a flow diagram illustrating operations of a method forimplementing high-performance machine learning inference in edge devicesaccording to some embodiments. Some or all of the operations 500 (orother processes described herein, or variations, and/or combinationsthereof) are performed under the control of one or more computer systemsconfigured with executable instructions and are implemented as code(e.g., executable instructions, one or more computer programs, or one ormore applications) executing collectively on one or more processors, byhardware or combinations thereof. The code is stored on acomputer-readable storage medium, for example, in the form of a computerprogram comprising instructions executable by one or more processors.The computer-readable storage medium is non-transitory. In someembodiments, one or more (or all) of the operations 500 are performed bycomputing devices within the provider network 100 of the other figures,and individual operations may be performed by the edge device managementservice 110, storage service 116, and/or ML service 104. Additionally,ones of the operations 500 may be performed by the UIF client module 124of the other figures.

The operations 500 include, at block 505, receiving a request to deploya machine learning (ML) model to one or more electronic devices outsideof a provider network. The request identifies a location of the MLmodel, where the ML model is of a first format associated with a firstML framework. Block 505 may be performed, for example, at an endpoint ofa provider network 100, or by an edge device management service 110. Therequest may also identify one or multiple of the one or more electronicdevices. The one or more electronic devices may be edge devices, whichcan be electronic computing devices located outside of the providernetwork that typically are deployed in particular environments ofinterest—i.e., outside of racks of a data center, as server computingdevices often are. For example, the edge device(s) may include smartcameras, smart speakers, smart displays, environmental/biological sensordevices (e.g., wearable devices), smart phones, tablets, and the like.The one or more electronic devices may have been previously provisionedand thus registered with the edge device management service 110 (oranother service of the provider network). The one or more electronicdevices may include homogeneous devices having the same or similarhardware/computing resources, or may include heterogenous devices havingdifferent hardware/computing resources—e.g., different architectures,ISAs, chipsets, cores, processing units (e.g., CPUs, GPUs, FPGAs),memories, sensors, etc. The one or more electronic devices may have UIFclient module 124 installed.

At block 510, the operations 500 also include, translating a first oneor more files of the ML model in the first format into a second one ormore files of a second format. Block 510 may be performed, for example,by a model optimizer 114A of a UIF server module 112 of the otherfigures. The first one or more files of the ML model in the first formatmay result from the ML model being trained using a first ML frameworksuch as, for example, MXNet, Caffe, TensorFlow, PyTorch, etc. The secondone or more files of a second format may or may not be associated with apublicly distributed ML framework—thus, although it could be one aformat associated with MXNet, Caffe, TensorFlow, PyTorch, etc., thesecond format may also be of another format that is publicly available(e.g., ONNX) or is private, but runnable by inference engines 132executing at the one or more electronic devices.

The operations 500 also include, at block 515, optimizing the second oneor more files based on at least one characteristic of the one or moreelectronic devices. Block 515 may be performed by a UIF server module112 or UIF client module 124, for example. The optimization may behardware-agnostic (and thus create optimizations for all types ofcomputing devices, such as via computational graph optimizations) orhardware specific (and thus create optimizations for specific types orcategories of computing devices, such as those sharing a commonarchitecture or set of capabilities). The optimization may includeremoving neural network layers with unused output to avoid unnecessarycomputation, combining or fusing certain layers (e.g., certainconvolutional, bias, and/or ReLU activation layers) to form a singlelayer, combining layers via layer aggregation, performing quantizationwhere certain data types may be changed (e.g., floating point values canbe changed to integers) to reduce inference computational latency,analyzing the model and performing kernel fusion (e.g., similar to layerfusion, albeit one layer down in the stack and thus different kernelsfor different operators can be fused together), configuring improved (oroptimal) scheduling for execution, etc.

At block 520, the operations 500 also include, causing the optimizedsecond one or more files to be provided to an inference engine of eachof the one or more electronic devices. Block 520 may be performed by theUIF server module 112 or edge device management service 110, where theoptimized second one or more files are transmitted directly to the oneor more electronic devices or to a client device (to be installed uponthe one or more electronic devices by a user or application), or wherethe optimized second one or more files are placed in a storage location(e.g., of a storage service) where the one or more electronic deviceswill obtain these files. Block 520 may also be performed by a storageservice 116, where the optimized second one or more files aretransmitted to the one or more electronic devices, perhaps in responseto a request for the files made by each of the one or more electronicdevices.

FIG. 6 is a block diagram of an illustrative operating environment inwhich machine learning models are trained and hosted according to someembodiments. The operating environment includes end user devices 602(e.g., client device 120 and/or edge device(s) 122), a model trainingsystem 620, a model hosting system 640, a training data store 660, atraining metrics data store 665, a container data store 670, a trainingmodel data store 675, and a model prediction data store 680.

A machine learning service described herein may include one or more ofthese entities, such as the model hosting system 640, model trainingsystem 620, etc.

In some embodiments, users, by way of user devices 602, interact withthe model training system 620 to provide data that causes the modeltraining system 620 to train one or more machine learning models. Amachine learning model, generally, may be thought of as one or moreequations that are “trained” using a set of data. In some embodiments,the model training system 620 provides ML functionalities as a Webservice, and thus messaging between user devices 602 and the modeltraining system 620 (or provider network 100), and/or between componentsof the model training system 620 (or provider network 100), may utilizeHTTP messages to transfer data in a machine-readable file format, suchas eXtensible Markup Language (XML) or JavaScript Object Notation(JSON).

The user devices 602 can interact with the model training system 620 viafrontend 629 of the model training system 620. For example, a userdevice 602 can provide a training request to the frontend 629 thatincludes a container image (or multiple container images, or anidentifier of one or multiple locations where container images arestored), an indicator of input data (e.g., an address or location ofinput data), one or more hyperparameter values (e.g., values indicatinghow the algorithm will operate, how many algorithms to run in parallel,how many clusters into which to separate data, etc.), and/or informationdescribing the computing machine on which to train a machine learningmodel (e.g., a graphical processing unit (GPU) instance type, a centralprocessing unit (CPU) instance type, an amount of memory to allocate, atype of virtual machine instance to use for training, etc.).

In some embodiments, the container image can include one or more layers,where each layer represents an executable instruction. Some or all ofthe executable instructions together represent an algorithm that definesa machine learning model. The executable instructions (e.g., thealgorithm) can be written in any programming language (e.g., Python,Ruby, C++, Java, etc.). In some embodiments, the algorithm ispre-generated and obtained by a user, via the user device 602, from analgorithm repository (e.g., a network-accessible marketplace, a datastore provided by a machine learning training service, etc.). In someembodiments, the algorithm is completely user-generated or partiallyuser-generated (e.g., user-provided code modifies or configures existingalgorithmic code).

In some embodiments, instead of providing a container image (oridentifier thereof) in the training request, the user device 602 mayprovide, in the training request, an algorithm written in anyprogramming language. The model training system 620 then packages thealgorithm into a container (optionally with other code, such as a “base”ML algorithm supplemented with user-provided code) that is eventuallyloaded into a virtual machine instance 622 for training a machinelearning model, as described in greater detail below. For example, auser, via a user device 602, may develop an algorithm/code using anapplication (e.g., an interactive web-based programming environment) andcause the algorithm/code to be provided—perhaps as part of a trainingrequest (or referenced in a training request)—to the model trainingsystem 620, where this algorithm/code may be containerized on its own orused together with an existing container having a machine learningframework, for example.

In some embodiments, instead of providing a container image in thetraining request, the user device 602 provides, in the training request,an indicator of a container image (e.g., an indication of an address ora location at which a container image is stored). For example, thecontainer image can be stored in a container data store 670, and thiscontainer image may have been previously created/uploaded by the user.The model training system 620 can retrieve the container image from theindicated location and create a container using the retrieved containerimage. The container is then loaded into a virtual machine instance 622for training a machine learning model, as described in greater detailbelow.

The model training system 620 can use the information provided by theuser device 602 to train a machine learning model in one or morepre-established virtual machine instances 622 in some embodiments. Inparticular, the model training system 620 includes a single physicalcomputing device or multiple physical computing devices that areinterconnected using one or more computing networks (not shown), wherethe physical computing device(s) host one or more virtual machineinstances 622. The model training system 620 can handle the acquisitionand configuration of compute capacity (e.g., containers, instances,etc., which are described in greater detail below) based on theinformation describing the computing machine on which to train a machinelearning model provided by the user device 602. The model trainingsystem 620 can then train machine learning models using the computecapacity, as is described in greater detail below. The model trainingsystem 620 can automatically scale up and down based on the volume oftraining requests received from user devices 602 via frontend 629,thereby relieving the user from the burden of having to worry aboutover-utilization (e.g., acquiring too little computing resources andsuffering performance issues) or under-utilization (e.g., acquiring morecomputing resources than necessary to train the machine learning models,and thus overpaying).

In some embodiments, the virtual machine instances 622 are utilized toexecute tasks. For example, such tasks can include training a machinelearning model. As shown in FIG. 6, each virtual machine instance 622includes an operating system (OS) 624, a language runtime 626, and oneor more ML training containers 630. Generally, the ML trainingcontainers 630 are logical units created within a virtual machineinstance using the resources available on that instance and can beutilized to isolate execution of a task from other processes (e.g., taskexecutions) occurring in the instance. In some embodiments, the MLtraining containers 630 are formed from one or more container images anda top container layer. Each container image may further include one ormore image layers, where each image layer represents an executableinstruction. As described above, some or all of the executableinstructions together represent an algorithm that defines a machinelearning model. Changes made to the ML training containers 630 (e.g.,creation of new files, modification of existing files, deletion offiles, etc.) are stored in the top container layer. If a ML trainingcontainer 630 is deleted, the top container layer is also deleted.However, the container image(s) that form a portion of the deleted MLtraining container 630 can remain unchanged. The ML training containers630 can be implemented, for example, as Linux containers (LXC), Dockercontainers, and the like.

The ML training containers 630 may include individual a runtime 634,code 637, and dependencies 632 needed by the code 637 in someembodiments. The runtime 634 can be defined by one or more executableinstructions that form at least a portion of a container image that isused to form the ML training container 630 (e.g., the executableinstruction(s) in the container image that define the operating systemand/or runtime to run in the container formed from the container image).The code 637 includes one or more executable instructions that form atleast a portion of a container image that is used to form the MLtraining container 630. For example, the code 637 includes theexecutable instructions in the container image that represent analgorithm that defines a machine learning model, which may reference (orutilize) code or libraries from dependencies 632. The runtime 634 isconfigured to execute the code 637 in response to an instruction tobegin machine learning model training. Execution of the code 637 resultsin the generation of model data, as described in greater detail below.

In some embodiments, the code 637 includes executable instructions thatrepresent algorithms that define different machine learning models. Forexample, the code 637 includes one set of executable instructions thatrepresent a first algorithm that defines a first machine learning modeland a second set of executable instructions that represent a secondalgorithm that defines a second machine learning model. In someembodiments, the virtual machine instance 622 executes the code 637 andtrains all of the machine learning models. In some embodiments, thevirtual machine instance 622 executes the code 637, selecting one of themachine learning models to train. For example, the virtual machineinstance 622 can identify a type of training data indicated by thetraining request and select a machine learning model to train (e.g.,execute the executable instructions that represent an algorithm thatdefines the selected machine learning model) that corresponds with theidentified type of training data.

In some embodiments, the runtime 634 is the same as the runtime 626utilized by the virtual machine instance 622. In some embodiments, theruntime 634 is different than the runtime 626 utilized by the virtualmachine instance 622.

In some embodiments, the model training system 620 uses one or morecontainer images included in a training request (or a container imageretrieved from the container data store 670 in response to a receivedtraining request) to create and initialize a ML training container 630in a virtual machine instance 622. For example, the model trainingsystem 620 creates a ML training container 630 that includes thecontainer image(s) and/or a top container layer.

Prior to beginning the training process, in some embodiments, the modeltraining system 620 retrieves training data from the location indicatedin the training request. For example, the location indicated in thetraining request can be a location in the training data store 660. Thus,the model training system 620 retrieves the training data from theindicated location in the training data store 660. In some embodiments,the model training system 620 does not retrieve the training data priorto beginning the training process. Rather, the model training system 620streams the training data from the indicated location during thetraining process. For example, the model training system 620 caninitially retrieve a portion of the training data and provide theretrieved portion to the virtual machine instance 622 training themachine learning model. Once the virtual machine instance 622 hasapplied and used the retrieved portion or once the virtual machineinstance 622 is about to use all of the retrieved portion (e.g., abuffer storing the retrieved portion is nearly empty), then the modeltraining system 620 can retrieve a second portion of the training dataand provide the second retrieved portion to the virtual machine instance622, and so on.

To perform the machine learning model training, the virtual machineinstance 622 executes code 637 stored in the ML training container 630in some embodiments. For example, the code 637 includes some or all ofthe executable instructions that form the container image of the MLtraining container 630 initialized therein. Thus, the virtual machineinstance 622 executes some or all of the executable instructions thatform the container image of the ML training container 630 initializedtherein to train a machine learning model. The virtual machine instance622 executes some or all of the executable instructions according to thehyperparameter values included in the training request. As anillustrative example, the virtual machine instance 622 trains a machinelearning model by identifying values for certain parameters (e.g.,coefficients, weights, centroids, etc.). The identified values depend onhyperparameters that define how the training is performed. Thus, thevirtual machine instance 622 can execute the executable instructions toinitiate a machine learning model training process, where the trainingprocess is run using the hyperparameter values included in the trainingrequest. Execution of the executable instructions can include thevirtual machine instance 622 applying the training data retrieved by themodel training system 620 as input parameters to some or all of theinstructions being executed.

In some embodiments, executing the executable instructions causes thevirtual machine instance 622 (e.g., the ML training container 630) togenerate model data. For example, the ML training container 630generates model data and stores the model data in a file system of theML training container 630. The model data includes characteristics ofthe machine learning model being trained, such as a number of layers inthe machine learning model, hyperparameters of the machine learningmodel, coefficients of the machine learning model, weights of themachine learning model, and/or the like. In particular, the generatedmodel data includes values for the characteristics that define a machinelearning model being trained. In some embodiments, executing theexecutable instructions causes a modification to the ML trainingcontainer 630 such that the model data is written to the top containerlayer of the ML training container 630 and/or the container image(s)that forms a portion of the ML training container 630 is modified toinclude the model data.

The virtual machine instance 622 (or the model training system 620itself) pulls the generated model data from the ML training container630 and stores the generated model data in the training model data store675 in an entry associated with the virtual machine instance 622 and/orthe machine learning model being trained. In some embodiments, thevirtual machine instance 622 generates a single file that includes modeldata and stores the single file in the training model data store 675. Insome embodiments, the virtual machine instance 622 generates multiplefiles during the course of training a machine learning model, where eachfile includes model data. In some embodiments, each model data fileincludes the same or different model data information (e.g., one fileidentifies the structure of an algorithm, another file includes a listof coefficients, etc.). The virtual machine instance 622 can package themultiple files into a single file once training is complete and storethe single file in the training model data store 675. Alternatively, thevirtual machine instance 622 stores the multiple files in the trainingmodel data store 675. The virtual machine instance 622 stores thefile(s) in the training model data store 675 while the training processis ongoing and/or after the training process is complete.

In some embodiments, the virtual machine instance 622 regularly storesmodel data file(s) in the training model data store 675 as the trainingprocess is ongoing. Thus, model data file(s) can be stored in thetraining model data store 675 at different times during the trainingprocess. Each set of model data files corresponding to a particular timeor each set of model data files present in the training model data store675 as of a particular time could be checkpoints that representdifferent versions of a partially-trained machine learning model duringdifferent stages of the training process. Accordingly, before trainingis complete, a user, via the user device 602 can submit a deploymentand/or execution request in a manner as described below to deploy and/orexecute a version of a partially trained machine learning model (e.g., amachine learning model trained as of a certain stage in the trainingprocess). A version of a partially-trained machine learning model can bebased on some or all of the model data files stored in the trainingmodel data store 675.

In some embodiments, a virtual machine instance 622 executes code 637stored in a plurality of ML training containers 630. For example, thealgorithm included in the container image can be in a format that allowsfor the parallelization of the training process. Thus, the modeltraining system 620 can create multiple copies of the container imageprovided in a training request and cause the virtual machine instance622 to load each container image copy in a separate ML trainingcontainer 630. The virtual machine instance 622 can then execute, inparallel, the code 637 stored in the ML training containers 630. Thevirtual machine instance 622 can further provide configurationinformation to each ML training container 630 (e.g., informationindicating that N ML training containers 630 are collectively training amachine learning model and that a particular ML training container 630receiving the configuration information is ML training container 630number X of N), which can be included in the resulting model data. Byparallelizing the training process, the model training system 620 cansignificantly reduce the training time in some embodiments.

In some embodiments, a plurality of virtual machine instances 622execute code 637 stored in a plurality of ML training containers 630.For example, the resources used to train a particular machine learningmodel can exceed the limitations of a single virtual machine instance622. However, the algorithm included in the container image can be in aformat that allows for the parallelization of the training process.Thus, the model training system 620 can create multiple copies of thecontainer image provided in a training request, initialize multiplevirtual machine instances 622, and cause each virtual machine instance622 to load a container image copy in one or more separate ML trainingcontainers 630. The virtual machine instances 622 can then each executethe code 637 stored in the ML training containers 630 in parallel. Themodel training system 620 can further provide configuration informationto each ML training container 630 via the virtual machine instances 622(e.g., information indicating that N ML training containers 630 arecollectively training a machine learning model and that a particular MLtraining container 630 receiving the configuration information is MLtraining container 630 number X of N, information indicating that Mvirtual machine instances 622 are collectively training a machinelearning model and that a particular ML training container 630 receivingthe configuration information is initialized in virtual machine instance622 number Y of M, etc.), which can be included in the resulting modeldata. As described above, by parallelizing the training process, themodel training system 620 can significantly reduce the training time insome embodiments.

In some embodiments, the model training system 620 includes a pluralityof physical computing devices and two or more of the physical computingdevices hosts one or more virtual machine instances 622 that execute thecode 637. Thus, the parallelization can occur over different physicalcomputing devices in addition to over different virtual machineinstances 622 and/or ML training containers 630.

In some embodiments, the model training system 620 includes a ML modelevaluator 628. The ML model evaluator 628 can monitor virtual machineinstances 622 as machine learning models are being trained, obtainingthe generated model data and processing the obtained model data togenerate model metrics. For example, the model metrics can includequality metrics, such as an error rate of the machine learning modelbeing trained, a statistical distribution of the machine learning modelbeing trained, a latency of the machine learning model being trained, aconfidence level of the machine learning model being trained (e.g., alevel of confidence that the accuracy of the machine learning modelbeing trained is known, etc. The ML model evaluator 628 can obtain themodel data for a machine learning model being trained and evaluationdata from the training data store 660. The evaluation data is separatefrom the data used to train a machine learning model and includes bothinput data and expected outputs (e.g., known results), and thus the MLmodel evaluator 628 can define a machine learning model using the modeldata and execute the machine learning model by providing the input dataas inputs to the machine learning model. The ML model evaluator 628 canthen compare the outputs of the machine learning model to the expectedoutputs and determine one or more quality metrics of the machinelearning model being trained based on the comparison (e.g., the errorrate can be a difference or distance between the machine learning modeloutputs and the expected outputs).

The ML model evaluator 628 periodically generates model metrics duringthe training process and stores the model metrics in the trainingmetrics data store 665 in some embodiments. While the machine learningmodel is being trained, a user, via the user device 602, can access andretrieve the model metrics from the training metrics data store 665. Theuser can then use the model metrics to determine whether to adjust thetraining process and/or to stop the training process. For example, themodel metrics can indicate that the machine learning model is performingpoorly (e.g., has an error rate above a threshold value, has astatistical distribution that is not an expected or desired distribution(e.g., not a binomial distribution, a Poisson distribution, a geometricdistribution, a normal distribution, Gaussian distribution, etc.), hasan execution latency above a threshold value, has a confidence levelbelow a threshold value)) and/or is performing progressively worse(e.g., the quality metric continues to worsen over time). In response,in some embodiments, the user, via the user device 602, can transmit arequest to the model training system 620 to modify the machine learningmodel being trained (e.g., transmit a modification request). The requestcan include a new or modified container image, a new or modifiedalgorithm, new or modified hyperparameter(s), and/or new or modifiedinformation describing the computing machine on which to train a machinelearning model. The model training system 620 can modify the machinelearning model accordingly. For example, the model training system 620can cause the virtual machine instance 622 to optionally delete anexisting ML training container 630, create and initialize a new MLtraining container 630 using some or all of the information included inthe request, and execute the code 637 stored in the new ML trainingcontainer 630 to restart the machine learning model training process. Asanother example, the model training system 620 can cause the virtualmachine instance 622 to modify the execution of code stored in anexisting ML training container 630 according to the data provided in themodification request. In some embodiments, the user, via the user device602, can transmit a request to the model training system 620 to stop themachine learning model training process. The model training system 620can then instruct the virtual machine instance 622 to delete the MLtraining container 630 and/or to delete any model data stored in thetraining model data store 675.

As described below, in some embodiments, the model data stored in thetraining model data store 675 is used by the model hosting system 640 todeploy machine learning models. Alternatively or additionally, a userdevice 602 or another computing device (not shown) can retrieve themodel data from the training model data store 675 to implement alearning algorithm in an external device. As an illustrative example, arobotic device can include sensors to capture input data. A user device602 can retrieve the model data from the training model data store 675and store the model data in the robotic device. The model data defines amachine learning model. Thus, the robotic device can provide thecaptured input data as an input to the machine learning model, resultingin an output. The robotic device can then perform an action (e.g., moveforward, raise an arm, generate a sound, etc.) based on the resultingoutput.

While the virtual machine instances 622 are shown in FIG. 6 as a singlegrouping of virtual machine instances 622, some embodiments of thepresent application separate virtual machine instances 622 that areactively assigned to execute tasks from those virtual machine instances622 that are not actively assigned to execute tasks. For example, thosevirtual machine instances 622 actively assigned to execute tasks aregrouped into an “active pool,” while those virtual machine instances 622not actively assigned to execute tasks are placed within a “warmingpool.” In some embodiments, those virtual machine instances 622 withinthe warming pool can be pre-initialized with an operating system,language runtimes, and/or other software required to enable rapidexecution of tasks (e.g., rapid initialization of machine learning modeltraining in ML training container(s) 630) in response to trainingrequests.

In some embodiments, the model training system 620 includes a processingunit, a network interface, a computer-readable medium drive, and aninput/output device interface, all of which can communicate with oneanother by way of a communication bus. The network interface can provideconnectivity to one or more networks or computing systems. Theprocessing unit can thus receive information and instructions from othercomputing systems or services (e.g., user devices 602, the model hostingsystem 640, etc.). The processing unit can also communicate to and froma memory of a virtual machine instance 622 and further provide outputinformation for an optional display via the input/output deviceinterface. The input/output device interface can also accept input froman optional input device. The memory can contain computer programinstructions (grouped as modules in some embodiments) that theprocessing unit executes in order to implement one or more aspects ofthe present disclosure.

In some embodiments, the model hosting system 640 includes a singlephysical computing device or multiple physical computing devices thatare interconnected using one or more computing networks (not shown),where the physical computing device(s) host one or more virtual machineinstances 642. The model hosting system 640 can handle the acquisitionand configuration of compute capacity (e.g., containers, instances,etc.) based on demand for the execution of trained machine learningmodels. The model hosting system 640 can then execute machine learningmodels using the compute capacity, as is described in greater detailbelow. The model hosting system 640 can automatically scale up and downbased on the volume of execution requests received from user devices 602via frontend 649 of the model hosting system 640, thereby relieving theuser from the burden of having to worry about over-utilization (e.g.,acquiring too little computing resources and suffering performanceissues) or under-utilization (e.g., acquiring more computing resourcesthan necessary to run the machine learning models, and thus overpaying).

In some embodiments, the virtual machine instances 642 are utilized toexecute tasks. For example, such tasks can include executing a machinelearning model. As shown in FIG. 6, each virtual machine instance 642includes an operating system (OS) 644, a language runtime 646, and oneor more ML scoring containers 650. The ML scoring containers 650 aresimilar to the ML training containers 630 in that the ML scoringcontainers 650 are logical units created within a virtual machineinstance using the resources available on that instance and can beutilized to isolate execution of a task from other processes (e.g., taskexecutions) occurring in the instance. In some embodiments, the MLscoring containers 650 are formed from one or more container images anda top container layer. Each container image further includes one or moreimage layers, where each image layer represents an executableinstruction. As described above, some or all of the executableinstructions together represent an algorithm that defines a machinelearning model. Changes made to the ML scoring containers 650 (e.g.,creation of new files, modification of existing files, deletion offiles, etc.) are stored in the top container layer. If a ML scoringcontainer 650 is deleted, the top container layer is also deleted.However, the container image(s) that form a portion of the deleted MLscoring container 650 can remain unchanged. The ML scoring containers650 can be implemented, for example, as Linux containers.

The ML scoring containers 650 each include a runtime 654, code 656, anddependencies 652 (e.g., supporting software such as libraries) needed bythe code 656 in some embodiments. The runtime 654 can be defined by oneor more executable instructions that form at least a portion of acontainer image that is used to form the ML scoring container 650 (e.g.,the executable instruction(s) in the container image that define theoperating system and/or runtime to run in the container formed from thecontainer image). The code 656 includes one or more executableinstructions that form at least a portion of a container image that isused to form the ML scoring container 650. For example, the code 656includes the executable instructions in the container image thatrepresent an algorithm that defines a machine learning model, which mayreference dependencies 652. The code 656 can also include model datathat represent characteristics of the defined machine learning model, asdescribed in greater detail below. The runtime 654 is configured toexecute the code 656 in response to an instruction to begin execution ofa machine learning model. Execution of the code 656 results in thegeneration of outputs (e.g., predicted results), as described in greaterdetail below.

In some embodiments, the runtime 654 is the same as the runtime 646utilized by the virtual machine instance 642. In some embodiments,runtime 654 is different than the runtime 646 utilized by the virtualmachine instance 642.

In some embodiments, the model hosting system 640 uses one or morecontainer images included in a deployment request (or a container imageretrieved from the container data store 670 in response to a receiveddeployment request) to create and initialize a ML scoring container 650in a virtual machine instance 642. For example, the model hosting system640 creates a ML scoring container 650 that includes the containerimage(s) and/or a top container layer.

As described above, a user device 602 can submit a deployment requestand/or an execution request to the model hosting system 640 via thefrontend 649 in some embodiments. A deployment request causes the modelhosting system 640 to deploy a trained machine learning model into avirtual machine instance 642. For example, the deployment request caninclude an identification of an endpoint (e.g., an endpoint name, suchas an HTTP endpoint name) and an identification of one or more trainedmachine learning models (e.g., a location of one or more model datafiles stored in the training model data store 675). Optionally, thedeployment request also includes an identification of one or morecontainer images stored in the container data store 670.

Upon receiving the deployment request, the model hosting system 640initializes ones or more ML scoring containers 650 in one or more hostedvirtual machine instance 642. In embodiments in which the deploymentrequest includes an identification of one or more container images, themodel hosting system 640 forms the ML scoring container(s) 650 from theidentified container image(s). For example, a container image identifiedin a deployment request can be the same container image used to form anML training container 630 used to train the machine learning modelcorresponding to the deployment request. Thus, the code 656 of the MLscoring container(s) 650 includes one or more executable instructions inthe container image(s) that represent an algorithm that defines amachine learning model. In embodiments in which the deployment requestdoes not include an identification of a container image, the modelhosting system 640 forms the ML scoring container(s) 650 from one ormore container images stored in the container data store 670 that areappropriate for executing the identified trained machine learningmodel(s). For example, an appropriate container image can be a containerimage that includes executable instructions that represent an algorithmthat defines the identified trained machine learning model(s).

The model hosting system 640 further forms the ML scoring container(s)650 by retrieving model data corresponding to the identified trainedmachine learning model(s) in some embodiments. For example, thedeployment request can identify a location of model data file(s) storedin the training model data store 675. In embodiments in which a singlemodel data file is identified in the deployment request, the modelhosting system 640 retrieves the identified model data file from thetraining model data store 675 and inserts the model data file into asingle ML scoring container 650, which forms a portion of code 656. Insome embodiments, the model data file is archived or compressed (e.g.,formed from a package of individual files). Thus, the model hostingsystem 640 unarchives or decompresses the model data file to obtainmultiple individual files and inserts the individual files into the MLscoring container 650. In some embodiments, the model hosting system 640stores the model data file in the same location as the location in whichthe model data file was stored in the ML training container 630 thatgenerated the model data file. For example, the model data fileinitially was stored in the top container layer of the ML trainingcontainer 630 at a certain offset, and the model hosting system 640 thenstores the model data file in the top container layer of the ML scoringcontainer 650 at the same offset.

In embodiments in which multiple model data files are identified in thedeployment request, the model hosting system 640 retrieves theidentified model data files from the training model data store 675. Themodel hosting system 640 can insert the model data files into the sameML scoring container 650, into different ML scoring containers 650initialized in the same virtual machine instance 642, or into differentML scoring containers 650 initialized in different virtual machineinstances 642. As an illustrative example, the deployment request canidentify multiple model data files corresponding to different trainedmachine learning models because the trained machine learning models arerelated (e.g., the output of one trained machine learning model is usedas an input to another trained machine learning model). Thus, the usermay desire to deploy multiple machine learning models to eventuallyreceive a single output that relies on the outputs of multiple machinelearning models.

In some embodiments, the model hosting system 640 associates theinitialized ML scoring container(s) 650 with the endpoint identified inthe deployment request. For example, each of the initialized ML scoringcontainer(s) 650 can be associated with a network address. The modelhosting system 640 can map the network address(es) to the identifiedendpoint, and the model hosting system 640 or another system (e.g., arouting system, not shown) can store the mapping. Thus, a user device602 can refer to trained machine learning model(s) stored in the MLscoring container(s) 650 using the endpoint. This allows for the networkaddress of an ML scoring container 650 to change without causing theuser operating the user device 602 to change the way in which the userrefers to a trained machine learning model.

Once the ML scoring container(s) 650 are initialized, the ML scoringcontainer(s) 650 are ready to execute trained machine learning model(s).In some embodiments, the user device 602 transmits an execution requestto the model hosting system 640 via the frontend 649, where theexecution request identifies an endpoint and includes an input to amachine learning model (e.g., a set of input data). The model hostingsystem 640 or another system (e.g., a routing system, not shown) canobtain the execution request, identify the ML scoring container(s) 650corresponding to the identified endpoint, and route the input to theidentified ML scoring container(s) 650.

In some embodiments, a virtual machine instance 642 executes the code656 stored in an identified ML scoring container 650 in response to themodel hosting system 640 receiving the execution request. In particular,execution of the code 656 causes the executable instructions in the code656 corresponding to the algorithm to read the model data file stored inthe ML scoring container 650, use the input included in the executionrequest as an input parameter, and generate a corresponding output. Asan illustrative example, the algorithm can include coefficients,weights, layers, cluster centroids, and/or the like. The executableinstructions in the code 656 corresponding to the algorithm can read themodel data file to determine values for the coefficients, weights,layers, cluster centroids, and/or the like. The executable instructionscan include input parameters, and the input included in the executionrequest can be supplied by the virtual machine instance 642 as the inputparameters. With the machine learning model characteristics and theinput parameters provided, execution of the executable instructions bythe virtual machine instance 642 can be completed, resulting in anoutput.

In some embodiments, the virtual machine instance 642 stores the outputin the model prediction data store 680. Alternatively or in addition,the virtual machine instance 642 transmits the output to the user device602 that submitted the execution result via the frontend 649.

In some embodiments, the execution request corresponds to a group ofrelated trained machine learning models. Thus, the ML scoring container650 can transmit the output to a second ML scoring container 650initialized in the same virtual machine instance 642 or in a differentvirtual machine instance 642. The virtual machine instance 642 thatinitialized the second ML scoring container 650 can then execute secondcode 656 stored in the second ML scoring container 650, providing thereceived output as an input parameter to the executable instructions inthe second code 656. The second ML scoring container 650 furtherincludes a model data file stored therein, which is read by theexecutable instructions in the second code 656 to determine values forthe characteristics defining the machine learning model. Execution ofthe second code 656 results in a second output. The virtual machineinstance 642 that initialized the second ML scoring container 650 canthen transmit the second output to the model prediction data store 680and/or the user device 602 via the frontend 649 (e.g., if no moretrained machine learning models are needed to generate an output) ortransmit the second output to a third ML scoring container 650initialized in the same or different virtual machine instance 642 (e.g.,if outputs from one or more additional trained machine learning modelsare needed), and the above-referenced process can be repeated withrespect to the third ML scoring container 650.

While the virtual machine instances 642 are shown in FIG. 6 as a singlegrouping of virtual machine instances 642, some embodiments of thepresent application separate virtual machine instances 642 that areactively assigned to execute tasks from those virtual machine instances642 that are not actively assigned to execute tasks. For example, thosevirtual machine instances 642 actively assigned to execute tasks aregrouped into an “active pool,” while those virtual machine instances 642not actively assigned to execute tasks are placed within a “warmingpool.” In some embodiments, those virtual machine instances 642 withinthe warming pool can be pre-initialized with an operating system,language runtimes, and/or other software required to enable rapidexecution of tasks (e.g., rapid initialization of ML scoringcontainer(s) 650, rapid execution of code 656 in ML scoringcontainer(s), etc.) in response to deployment and/or execution requests.

In some embodiments, the model hosting system 640 includes a processingunit, a network interface, a computer-readable medium drive, and aninput/output device interface, all of which can communicate with oneanother by way of a communication bus. The network interface can provideconnectivity to one or more networks or computing systems. Theprocessing unit can thus receive information and instructions from othercomputing systems or services (e.g., user devices 602, the modeltraining system 620, etc.). The processing unit can also communicate toand from a memory of a virtual machine instance 642 and further provideoutput information for an optional display via the input/output deviceinterface. The input/output device interface can also accept input froman optional input device. The memory can contain computer programinstructions (grouped as modules in some embodiments) that theprocessing unit executes in order to implement one or more aspects ofthe present disclosure.

In some embodiments, the operating environment supports many differenttypes of machine learning models, such as multi arm bandit models,reinforcement learning models, ensemble machine learning models, deeplearning models, and/or the like.

The model training system 620 and the model hosting system 640 depictedin FIG. 6 are not meant to be limiting. For example, the model trainingsystem 620 and/or the model hosting system 640 could also operate withina computing environment having a fewer or greater number of devices thanare illustrated in FIG. 6. Thus, the depiction of the model trainingsystem 620 and/or the model hosting system 640 in FIG. 6 may be taken asillustrative and not limiting to the present disclosure. For example,the model training system 620 and/or the model hosting system 640 orvarious constituents thereof could implement various Web servicescomponents, hosted or “cloud” computing environments, and/orpeer-to-peer network configurations to implement at least a portion ofthe processes described herein. In some embodiments, the model trainingsystem 620 and/or the model hosting system 640 are implemented directlyin hardware or software executed by hardware devices and may, forinstance, include one or more physical or virtual servers implemented onphysical computer hardware configured to execute computer-executableinstructions for performing the various features that are describedherein. The one or more servers can be geographically dispersed orgeographically co-located, for instance, in one or more points ofpresence (POPs) or regional data centers.

The frontend 629 processes all training requests received from userdevices 602 and provisions virtual machine instances 622. In someembodiments, the frontend 629 serves as a front door to all the otherservices provided by the model training system 620. The frontend 629processes the requests and makes sure that the requests are properlyauthorized. For example, the frontend 629 may determine whether the userassociated with the training request is authorized to initiate thetraining process.

Similarly, frontend 649 processes all deployment and execution requestsreceived from user devices 602 and provisions virtual machine instances642. In some embodiments, the frontend 649 serves as a front door to allthe other services provided by the model hosting system 640. Thefrontend 649 processes the requests and makes sure that the requests areproperly authorized. For example, the frontend 649 may determine whetherthe user associated with a deployment request or an execution request isauthorized to access the indicated model data and/or to execute theindicated machine learning model.

The training data store 660 stores training data and/or evaluation data.The training data can be data used to train machine learning models andevaluation data can be data used to evaluate the performance of machinelearning models. In some embodiments, the training data and theevaluation data have common data. In some embodiments, the training dataand the evaluation data do not have common data. In some embodiments,the training data includes input data and expected outputs. While thetraining data store 660 is depicted as being located external to themodel training system 620 and the model hosting system 640, this is notmeant to be limiting. For example, in some embodiments not shown, thetraining data store 660 is located internal to at least one of the modeltraining system 620 or the model hosting system 640.

In some embodiments, the training metrics data store 665 stores modelmetrics. While the training metrics data store 665 is depicted as beinglocated external to the model training system 620 and the model hostingsystem 640, this is not meant to be limiting. For example, in someembodiments not shown, the training metrics data store 665 is locatedinternal to at least one of the model training system 620 or the modelhosting system 640.

The container data store 670 stores container images, such as containerimages used to form ML training containers 630 and/or ML scoringcontainers 650, that can be retrieved by various virtual machineinstances 622 and/or 642. While the container data store 670 is depictedas being located external to the model training system 620 and the modelhosting system 640, this is not meant to be limiting. For example, insome embodiments not shown, the container data store 670 is locatedinternal to at least one of the model training system 620 and the modelhosting system 640.

The training model data store 675 stores model data files. In someembodiments, some of the model data files are comprised of a singlefile, while other model data files are packages of multiple individualfiles. While the training model data store 675 is depicted as beinglocated external to the model training system 620 and the model hostingsystem 640, this is not meant to be limiting. For example, in someembodiments not shown, the training model data store 675 is locatedinternal to at least one of the model training system 620 or the modelhosting system 640.

The model prediction data store 680 stores outputs (e.g., executionresults) generated by the ML scoring containers 650 in some embodiments.While the model prediction data store 680 is depicted as being locatedexternal to the model training system 620 and the model hosting system640, this is not meant to be limiting. For example, in some embodimentsnot shown, the model prediction data store 680 is located internal to atleast one of the model training system 620 and the model hosting system640.

While the model training system 620, the model hosting system 640, thetraining data store 660, the training metrics data store 665, thecontainer data store 670, the training model data store 675, and themodel prediction data store 680 are illustrated as separate components,this is not meant to be limiting. In some embodiments, any one or all ofthese components can be combined to perform the functionality describedherein. For example, any one or all of these components can beimplemented by a single computing device, or by multiple distinctcomputing devices, such as computer servers, logically or physicallygrouped together to collectively operate as a server system. Any one orall of these components can communicate via a shared internal network,and the collective system (e.g., also referred to herein as a machinelearning service) can communicate with one or more of the user devices602 via the one or more network(s) 106.

Various example user devices 602 are shown in FIG. 6, including adesktop computer, laptop, and a mobile phone, each provided by way ofillustration. In general, the user devices 602 can be any computingdevice such as a desktop, laptop or tablet computer, personal computer,wearable computer, server, personal digital assistant (PDA), hybridPDA/mobile phone, mobile phone, electronic book reader, set-top box,voice command device, camera, digital media player, and the like. Insome embodiments, the model training system 620 and/or the model hostingsystem 640 provides the user devices 602 with one or more userinterfaces, command-line interfaces (CLI), application programinginterfaces (API), and/or other programmatic interfaces for submittingtraining requests, deployment requests, and/or execution requests. Insome embodiments, the user devices 602 can execute a stand-aloneapplication that interacts with the model training system 620 and/or themodel hosting system 640 for submitting training requests, deploymentrequests, and/or execution requests.

In some embodiments, the network 106 includes any wired network,wireless network, or combination thereof. For example, the network 106may be a personal area network, local area network, wide area network,over-the-air broadcast network (e.g., for radio or television), cablenetwork, satellite network, cellular telephone network, or combinationthereof. As a further example, the network 106 may be a publiclyaccessible network of linked networks, possibly operated by variousdistinct parties, such as the Internet. In some embodiments, the network106 may be a private or semi-private network, such as a corporate oruniversity intranet. The network 106 may include one or more wirelessnetworks, such as a Global System for Mobile Communications (GSM)network, a Code Division Multiple Access (CDMA) network, a Long TermEvolution (LTE) network, or any other type of wireless network. Thenetwork 106 can use protocols and components for communicating via theInternet or any of the other aforementioned types of networks. Forexample, the protocols used by the network 106 may include HTTP, HTTPSecure (HTTPS), Message Queue Telemetry Transport (MQTT), ConstrainedApplication Protocol (CoAP), and the like. Protocols and components forcommunicating via the Internet or any of the other aforementioned typesof communication networks are well known to those skilled in the artand, thus, are not described in more detail herein.

FIG. 7 illustrates an example provider network (or “service providersystem”) environment according to some embodiments. A provider network700 may provide resource virtualization to customers via one or morevirtualization services 710 that allow customers to purchase, rent, orotherwise obtain instances 712 of virtualized resources, including butnot limited to computation and storage resources, implemented on deviceswithin the provider network or networks in one or more data centers.Local Internet Protocol (IP) addresses 716 may be associated with theresource instances 712; the local IP addresses are the internal networkaddresses of the resource instances 712 on the provider network 700. Insome embodiments, the provider network 700 may also provide public IPaddresses 714 and/or public IP address ranges (e.g., Internet Protocolversion 4 (IPv4) or Internet Protocol version 6 (IPv6) addresses) thatcustomers may obtain from the provider 700.

Conventionally, the provider network 700, via the virtualizationservices 710, may allow a customer of the service provider (e.g., acustomer that operates one or more client networks 750A-750C includingone or more customer device(s) 752) to dynamically associate at leastsome public IP addresses 714 assigned or allocated to the customer withparticular resource instances 712 assigned to the customer. The providernetwork 700 may also allow the customer to remap a public IP address714, previously mapped to one virtualized computing resource instance712 allocated to the customer, to another virtualized computing resourceinstance 712 that is also allocated to the customer. Using thevirtualized computing resource instances 712 and public IP addresses 714provided by the service provider, a customer of the service providersuch as the operator of customer network(s) 750A-750C may, for example,implement customer-specific applications and present the customer'sapplications on an intermediate network 740, such as the Internet. Othernetwork entities 720 on the intermediate network 740 may then generatetraffic to a destination public IP address 714 published by the customernetwork(s) 750A-750C; the traffic is routed to the service provider datacenter, and at the data center is routed, via a network substrate, tothe local IP address 716 of the virtualized computing resource instance712 currently mapped to the destination public IP address 714.Similarly, response traffic from the virtualized computing resourceinstance 712 may be routed via the network substrate back onto theintermediate network 740 to the source entity 720.

Local IP addresses, as used herein, refer to the internal or “private”network addresses, for example, of resource instances in a providernetwork. Local IP addresses can be within address blocks reserved byInternet Engineering Task Force (IETF) Request for Comments (RFC) 1918and/or of an address format specified by IETF RFC 4193 and may bemutable within the provider network. Network traffic originating outsidethe provider network is not directly routed to local IP addresses;instead, the traffic uses public IP addresses that are mapped to thelocal IP addresses of the resource instances. The provider network mayinclude networking devices or appliances that provide network addresstranslation (NAT) or similar functionality to perform the mapping frompublic IP addresses to local IP addresses and vice versa.

Public IP addresses are Internet mutable network addresses that areassigned to resource instances, either by the service provider or by thecustomer. Traffic routed to a public IP address is translated, forexample via 1:1 NAT, and forwarded to the respective local IP address ofa resource instance.

Some public IP addresses may be assigned by the provider networkinfrastructure to particular resource instances; these public IPaddresses may be referred to as standard public IP addresses, or simplystandard IP addresses. In some embodiments, the mapping of a standard IPaddress to a local IP address of a resource instance is the defaultlaunch configuration for all resource instance types.

At least some public IP addresses may be allocated to or obtained bycustomers of the provider network 700; a customer may then assign theirallocated public IP addresses to particular resource instances allocatedto the customer. These public IP addresses may be referred to ascustomer public IP addresses, or simply customer IP addresses. Insteadof being assigned by the provider network 700 to resource instances asin the case of standard IP addresses, customer IP addresses may beassigned to resource instances by the customers, for example via an APIprovided by the service provider. Unlike standard IP addresses, customerIP addresses are allocated to customer accounts and can be remapped toother resource instances by the respective customers as necessary ordesired. A customer IP address is associated with a customer's account,not a particular resource instance, and the customer controls that IPaddress until the customer chooses to release it. Unlike conventionalstatic IP addresses, customer IP addresses allow the customer to maskresource instance or availability zone failures by remapping thecustomer's public IP addresses to any resource instance associated withthe customer's account. The customer IP addresses, for example, enable acustomer to engineer around problems with the customer's resourceinstances or software by remapping customer IP addresses to replacementresource instances.

FIG. 8 is a block diagram of an example provider network that provides astorage service and a hardware virtualization service to customers,according to some embodiments. Hardware virtualization service 820provides multiple computation resources 824 (e.g., VMs) to customers.The computation resources 824 may, for example, be rented or leased tocustomers of the provider network 800 (e.g., to a customer thatimplements customer network 850). Each computation resource 824 may beprovided with one or more local IP addresses. Provider network 800 maybe configured to route packets from the local IP addresses of thecomputation resources 824 to public Internet destinations, and frompublic Internet sources to the local IP addresses of computationresources 824.

Provider network 800 may provide a customer network 850, for examplecoupled to intermediate network 840 via local network 856, the abilityto implement virtual computing systems 892 via hardware virtualizationservice 820 coupled to intermediate network 840 and to provider network800. In some embodiments, hardware virtualization service 820 mayprovide one or more APIs 802, for example a web services interface, viawhich a customer network 850 may access functionality provided by thehardware virtualization service 820, for example via a console 894(e.g., a web-based application, standalone application, mobileapplication, etc.). In some embodiments, at the provider network 800,each virtual computing system 892 at customer network 850 may correspondto a computation resource 824 that is leased, rented, or otherwiseprovided to customer network 850.

From an instance of a virtual computing system 892 and/or anothercustomer device 890 (e.g., via console 894), the customer may access thefunctionality of storage service 810, for example via one or more APIs802, to access data from and store data to storage resources 818A-818Nof a virtual data store 816 (e.g., a folder or “bucket”, a virtualizedvolume, a database, etc.) provided by the provider network 800. In someembodiments, a virtualized data store gateway (not shown) may beprovided at the customer network 850 that may locally cache at leastsome data, for example frequently-accessed or critical data, and thatmay communicate with storage service 810 via one or more communicationschannels to upload new or modified data from a local cache so that theprimary store of data (virtualized data store 816) is maintained. Insome embodiments, a user, via a virtual computing system 892 and/or onanother customer device 890, may mount and access virtual data store 816volumes via storage service 810 acting as a storage virtualizationservice, and these volumes may appear to the user as local (virtualized)storage 898.

While not shown in FIG. 8, the virtualization service(s) may also beaccessed from resource instances within the provider network 800 viaAPI(s) 802. For example, a customer, appliance service provider, orother entity may access a virtualization service from within arespective virtual network on the provider network 800 via an API 802 torequest allocation of one or more resource instances within the virtualnetwork or within another virtual network.

Illustrative System

In some embodiments, a system that implements a portion or all of thetechniques for high-performance machine learning inference inheterogeneous edge devices as described herein may include ageneral-purpose computer system that includes or is configured to accessone or more computer-accessible media, such as computer system 900illustrated in FIG. 9. In the illustrated embodiment, computer system900 includes one or more processors 910 coupled to a system memory 920via an input/output (I/O) interface 930. Computer system 900 furtherincludes a network interface 940 coupled to I/O interface 930. WhileFIG. 9 shows computer system 900 as a single computing device, invarious embodiments a computer system 900 may include one computingdevice or any number of computing devices configured to work together asa single computer system 900.

In various embodiments, computer system 900 may be a uniprocessor systemincluding one processor 910, or a multiprocessor system includingseveral processors 910 (e.g., two, four, eight, or another suitablenumber). Processors 910 may be any suitable processors capable ofexecuting instructions. For example, in various embodiments, processors910 may be general-purpose or embedded processors implementing any of avariety of instruction set architectures (ISAs), such as the x86, ARM,PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. Inmultiprocessor systems, each of processors 910 may commonly, but notnecessarily, implement the same ISA.

System memory 920 may store instructions and data accessible byprocessor(s) 910. In various embodiments, system memory 920 may beimplemented using any suitable memory technology, such as random-accessmemory (RAM), static RAM (SRAM), synchronous dynamic RAM (SDRAM),nonvolatile/Flash-type memory, or any other type of memory. In theillustrated embodiment, program instructions and data implementing oneor more desired functions, such as those methods, techniques, and datadescribed above are shown stored within system memory 920 as code 925and data 926.

In one embodiment, I/O interface 930 may be configured to coordinate I/Otraffic between processor 910, system memory 920, and any peripheraldevices in the device, including network interface 940 or otherperipheral interfaces. In some embodiments, I/O interface 930 mayperform any necessary protocol, timing or other data transformations toconvert data signals from one component (e.g., system memory 920) into aformat suitable for use by another component (e.g., processor 910). Insome embodiments, I/O interface 930 may include support for devicesattached through various types of peripheral buses, such as a variant ofthe Peripheral Component Interconnect (PCI) bus standard or theUniversal Serial Bus (USB) standard, for example. In some embodiments,the function of I/O interface 930 may be split into two or more separatecomponents, such as a north bridge and a south bridge, for example.Also, in some embodiments some or all of the functionality of I/Ointerface 930, such as an interface to system memory 920, may beincorporated directly into processor 910.

Network interface 940 may be configured to allow data to be exchangedbetween computer system 900 and other devices 960 attached to a networkor networks 950, such as other computer systems or devices asillustrated in FIG. 1, for example. In various embodiments, networkinterface 940 may support communication via any suitable wired orwireless general data networks, such as types of Ethernet network, forexample. Additionally, network interface 940 may support communicationvia telecommunications/telephony networks such as analog voice networksor digital fiber communications networks, via storage area networks(SANs) such as Fibre Channel SANs, or via I/O any other suitable type ofnetwork and/or protocol.

In some embodiments, a computer system 900 includes one or more offloadcards 970 (including one or more processors 975, and possibly includingthe one or more network interfaces 940) that are connected using an I/Ointerface 930 (e.g., a bus implementing a version of the PeripheralComponent Interconnect-Express (PCI-E) standard, or another interconnectsuch as a QuickPath interconnect (QPI) or UltraPath interconnect (UPI)).For example, in some embodiments the computer system 900 may act as ahost electronic device (e.g., operating as part of a hardwarevirtualization service) that hosts compute instances, and the one ormore offload cards 970 execute a virtualization manager that can managecompute instances that execute on the host electronic device. As anexample, in some embodiments the offload card(s) 970 can perform computeinstance management operations such as pausing and/or un-pausing computeinstances, launching and/or terminating compute instances, performingmemory transfer/copying operations, etc. These management operationsmay, in some embodiments, be performed by the offload card(s) 970 incoordination with a hypervisor (e.g., upon a request from a hypervisor)that is executed by the other processors 910A-910N of the computersystem 900. However, in some embodiments the virtualization managerimplemented by the offload card(s) 970 can accommodate requests fromother entities (e.g., from compute instances themselves), and may notcoordinate with (or service) any separate hypervisor.

In some embodiments, system memory 920 may be one embodiment of acomputer-accessible medium configured to store program instructions anddata as described above. However, in other embodiments, programinstructions and/or data may be received, sent or stored upon differenttypes of computer-accessible media. Generally speaking, acomputer-accessible medium may include non-transitory storage media ormemory media such as magnetic or optical media, e.g., disk or DVD/CDcoupled to computer system 900 via I/O interface 930. A non-transitorycomputer-accessible storage medium may also include any volatile ornon-volatile media such as RAM (e.g., SDRAM, double data rate (DDR)SDRAM, SRAM, etc.), read only memory (ROM), etc., that may be includedin some embodiments of computer system 900 as system memory 920 oranother type of memory. Further, a computer-accessible medium mayinclude transmission media or signals such as electrical,electromagnetic, or digital signals, conveyed via a communication mediumsuch as a network and/or a wireless link, such as may be implemented vianetwork interface 940.

As discussed herein, different approaches can be implemented in variousenvironments in accordance with the described embodiments. For example,FIG. 10 illustrates an example of an environment 1000 for implementingaspects in accordance with various embodiments. For example, in someembodiments the request messages described are HTTP requests that arereceived by a web server (e.g., web server 1006), and the users, viaelectronic devices, may interact with the provider network via a webportal provided via the web server 1006 and application server 1008. Aswill be appreciated, although a web-based environment is used forpurposes of explanation, different environments may be used, asappropriate, to implement various embodiments. The system includes anelectronic client device 1002, which may also be referred to as a clientdevice and can be any appropriate device operable to send and receiverequests, messages or information over an appropriate network 1004 andconvey information back to a user of the device 1002. Examples of suchclient devices include personal computers (PCs), cell phones, handheldmessaging devices, laptop computers, set-top boxes, personal dataassistants, electronic book readers, wearable electronic devices (e.g.,glasses, wristbands, monitors), and the like. The one or more networks1004 can include any appropriate network, including an intranet, theInternet, a cellular network, a local area network, or any other suchnetwork or combination thereof. Components used for such a system candepend at least in part upon the type of network and/or environmentselected. Protocols and components for communicating via such a networkare well known and will not be discussed herein in detail. Communicationover the network can be enabled via wired or wireless connections andcombinations thereof. In this example, the network 1004 includes theInternet, as the environment includes a web server 1006 for receivingrequests and serving content in response thereto, although for othernetworks an alternative device serving a similar purpose could be used,as would be apparent to one of ordinary skill in the art.

The illustrative environment includes at least one application server1008 and a data store 1010. It should be understood that there can beseveral application servers, layers, or other elements, processes orcomponents, which may be chained or otherwise configured, which caninteract to perform tasks such as obtaining data from an appropriatedata store. As used herein the term “data store” refers to any device orcombination of devices capable of storing, accessing and retrievingdata, which may include any combination and number of data servers,databases, data storage devices and data storage media, in any standard,distributed or clustered environment. The application server 1008 caninclude any appropriate hardware and software for integrating with thedata store 1010 as needed to execute aspects of one or more applicationsfor the client device 1002 and handling a majority of the data accessand business logic for an application. The application server 1008provides access control services in cooperation with the data store 1010and is able to generate content such as text, graphics, audio, video,etc., to be transferred to the client device 1002, which may be servedto the user by the web server in the form of HyperText Markup Language(HTML), Extensible Markup Language (XML), JavaScript Object Notation(JSON), or another appropriate unstructured or structured language inthis example. The handling of all requests and responses, as well as thedelivery of content between the client device 1002 and the applicationserver 1008, can be handled by the web server 1006. It should beunderstood that the web server 1006 and application server 1008 are notrequired and are merely example components, as structured code discussedherein can be executed on any appropriate device or host machine asdiscussed elsewhere herein.

The data store 1010 can include several separate data tables, databases,or other data storage mechanisms and media for storing data relating toa particular aspect. For example, the data store illustrated includesmechanisms for storing production data 1012 and user information 1016,which can be used to serve content for the production side. The datastore 1010 also is shown to include a mechanism for storing log orsession data 1014. It should be understood that there can be many otheraspects that may need to be stored in the data store, such as page imageinformation and access rights information, which can be stored in any ofthe above listed mechanisms as appropriate or in additional mechanismsin the data store 1010. The data store 1010 is operable, through logicassociated therewith, to receive instructions from the applicationserver 1008 and obtain, update, or otherwise process data in responsethereto. In one example, a user might submit a search request for acertain type of item. In this case, the data store 1010 might access theuser information 1016 to verify the identity of the user and can accessa production data 1012 to obtain information about items of that type.The information can then be returned to the user, such as in a listingof results on a web page that the user is able to view via a browser onthe user device 1002. Information for a particular item of interest canbe viewed in a dedicated page or window of the browser.

The web server 1006, application server 1008, and/or data store 1010 maybe implemented by one or more electronic devices 1020, which can also bereferred to as electronic server devices or server end stations, and mayor may not be located in different geographic locations. Each of the oneor more electronic devices 1020 may include an operating system thatprovides executable program instructions for the general administrationand operation of that device and typically will includecomputer-readable medium storing instructions that, when executed by aprocessor of the device, allow the device to perform its intendedfunctions. Suitable implementations for the operating system and generalfunctionality of the devices are known or commercially available and arereadily implemented by persons having ordinary skill in the art,particularly in light of the disclosure herein.

The environment may be a distributed computing environment utilizingseveral computer systems and components that are interconnected viacommunication links, using one or more computer networks or directconnections. However, it will be appreciated by those of ordinary skillin the art that such a system could operate equally well in a systemhaving fewer or a greater number of components than are illustrated inFIG. 10. Thus, the depiction of the environment 1000 in FIG. 10 shouldbe taken as being illustrative in nature and not limiting to the scopeof the disclosure.

Various embodiments discussed or suggested herein can be implemented ina wide variety of operating environments, which in some cases caninclude one or more user computers, computing devices, or processingdevices which can be used to operate any of a number of applications.User or client devices can include any of a number of general purposepersonal computers, such as desktop or laptop computers running astandard operating system, as well as cellular, wireless, and handhelddevices running mobile software and capable of supporting a number ofnetworking and messaging protocols. Such a system also can include anumber of workstations running any of a variety ofcommercially-available operating systems and other known applicationsfor purposes such as development and database management. These devicesalso can include other electronic devices, such as dummy terminals,thin-clients, gaming systems, and/or other devices capable ofcommunicating via a network.

Most embodiments utilize at least one network that would be familiar tothose skilled in the art for supporting communications using any of avariety of commercially-available protocols, such as TransmissionControl Protocol/Internet Protocol (TCP/IP), File Transfer Protocol(FTP), Universal Plug and Play (UPnP), Network File System (NFS), CommonInternet File System (CIFS), Extensible Messaging and Presence Protocol(XMPP), AppleTalk, etc. The network(s) can include, for example, a localarea network (LAN), a wide-area network (WAN), a virtual private network(VPN), the Internet, an intranet, an extranet, a public switchedtelephone network (PSTN), an infrared network, a wireless network, andany combination thereof.

In embodiments utilizing a web server, the web server can run any of avariety of server or mid-tier applications, including HTTP servers, FileTransfer Protocol (FTP) servers, Common Gateway Interface (CGI) servers,data servers, Java servers, business application servers, etc. Theserver(s) also may be capable of executing programs or scripts inresponse requests from user devices, such as by executing one or moreWeb applications that may be implemented as one or more scripts orprograms written in any programming language, such as Java®, C, C# orC++, or any scripting language, such as Perl, Python, PHP, or TCL, aswell as combinations thereof. The server(s) may also include databaseservers, including without limitation those commercially available fromOracle®, Microsoft®, Sybase®, IBM®, etc. The database servers may berelational or non-relational (e.g., “NoSQL”), distributed ornon-distributed, etc.

The environment can include a variety of data stores and other memoryand storage media as discussed above. These can reside in a variety oflocations, such as on a storage medium local to (and/or resident in) oneor more of the computers or remote from any or all of the computersacross the network. In a particular set of embodiments, the informationmay reside in a storage-area network (SAN) familiar to those skilled inthe art. Similarly, any necessary files for performing the functionsattributed to the computers, servers, or other network devices may bestored locally and/or remotely, as appropriate. Where a system includescomputerized devices, each such device can include hardware elementsthat may be electrically coupled via a bus, the elements including, forexample, at least one central processing unit (CPU), at least one inputdevice (e.g., a mouse, keyboard, controller, touch screen, or keypad),and/or at least one output device (e.g., a display device, printer, orspeaker). Such a system may also include one or more storage devices,such as disk drives, optical storage devices, and solid-state storagedevices such as random-access memory (RAM) or read-only memory (ROM), aswell as removable media devices, memory cards, flash cards, etc.

Such devices also can include a computer-readable storage media reader,a communications device (e.g., a modem, a network card (wireless orwired), an infrared communication device, etc.), and working memory asdescribed above. The computer-readable storage media reader can beconnected with, or configured to receive, a computer-readable storagemedium, representing remote, local, fixed, and/or removable storagedevices as well as storage media for temporarily and/or more permanentlycontaining, storing, transmitting, and retrieving computer-readableinformation. The system and various devices also typically will includea number of software applications, modules, services, or other elementslocated within at least one working memory device, including anoperating system and application programs, such as a client applicationor web browser. It should be appreciated that alternate embodiments mayhave numerous variations from that described above. For example,customized hardware might also be used and/or particular elements mightbe implemented in hardware, software (including portable software, suchas applets), or both. Further, connection to other computing devicessuch as network input/output devices may be employed.

Storage media and computer readable media for containing code, orportions of code, can include any appropriate media known or used in theart, including storage media and communication media, such as but notlimited to volatile and non-volatile, removable and non-removable mediaimplemented in any method or technology for storage and/or transmissionof information such as computer readable instructions, data structures,program modules, or other data, including RAM, ROM, ElectricallyErasable Programmable Read-Only Memory (EEPROM), flash memory or othermemory technology, Compact Disc-Read Only Memory (CD-ROM), DigitalVersatile Disk (DVD) or other optical storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by a system device. Based on the disclosureand teachings provided herein, a person of ordinary skill in the artwill appreciate other ways and/or methods to implement the variousembodiments.

In the preceding description, various embodiments are described. Forpurposes of explanation, specific configurations and details are setforth in order to provide a thorough understanding of the embodiments.However, it will also be apparent to one skilled in the art that theembodiments may be practiced without the specific details. Furthermore,well-known features may be omitted or simplified in order not to obscurethe embodiment being described.

Bracketed text and blocks with dashed borders (e.g., large dashes, smalldashes, dot-dash, and dots) are used herein to illustrate optionaloperations that add additional features to some embodiments. However,such notation should not be taken to mean that these are the onlyoptions or optional operations, and/or that blocks with solid bordersare not optional in certain embodiments.

Reference numerals with suffix letters (e.g., 122A-122N) may be used toindicate that there can be one or multiple instances of the referencedentity in various embodiments, and when there are multiple instances,each does not need to be identical but may instead share some generaltraits or act in common ways. Further, the particular suffixes used arenot meant to imply that a particular amount of the entity exists unlessspecifically indicated to the contrary. Thus, two entities using thesame or different suffix letters may or may not have the same number ofinstances in various embodiments.

References to “one embodiment,” “an embodiment,” “an exampleembodiment,” etc., indicate that the embodiment described may include aparticular feature, structure, or characteristic, but every embodimentmay not necessarily include the particular feature, structure, orcharacteristic. Moreover, such phrases are not necessarily referring tothe same embodiment. Further, when a particular feature, structure, orcharacteristic is described in connection with an embodiment, it issubmitted that it is within the knowledge of one skilled in the art toaffect such feature, structure, or characteristic in connection withother embodiments whether or not explicitly described.

Moreover, in the various embodiments described above, unlessspecifically noted otherwise, disjunctive language such as the phrase“at least one of A, B, or C” is intended to be understood to mean eitherA, B, or C, or any combination thereof (e.g., A, B, and/or C). As such,disjunctive language is not intended to, nor should it be understood to,imply that a given embodiment requires at least one of A, at least oneof B, or at least one of C to each be present.

The specification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense. It will, however, beevident that various modifications and changes may be made thereuntowithout departing from the broader spirit and scope of the disclosure asset forth in the claims.

What is claimed is:
 1. A computer-implemented method comprising:receiving, at a web service endpoint of a provider network, a request todeploy a machine learning (ML) model to one or more electronic deviceslocated outside of the provider network, the request identifying the MLmodel or a location of the ML model, the request further including oneor more identifiers of the one or more electronic devices, the ML modelbeing of a first format generated by a first ML framework; obtaining afirst one or more files of the ML model based on the request;translating the first one or more files of the first format into asecond one or more files of a second format; optimizing the second oneor more files for improved execution at the one or more electronicdevices based at least in part on at least one hardware characteristicof the one or more electronic devices, the optimizing includingperforming one or more of layer fusion, quantization, optimalscheduling, or kernel fusion; providing the optimized second one or morefiles to an inference engine of each of the one or more electronicdevices to utilize the optimized second one or more files to generateinferences; and receiving, at the web service endpoint, a second requestto deploy a second ML model to a second one or more electronic deviceslocated outside of the provider network, the second request identifyingthe second ML model or a second location of the second ML model, thesecond ML model being of a third format generated by a second MLframework that is different from the first ML framework.
 2. Thecomputer-implemented method of claim 1, further comprising: obtaining athird one or more files of the second ML model based on the secondrequest; translating the third one or more files of the third formatinto a fourth one or more files of the second format; optimizing thefourth one or more files for improved execution at the second one ormore electronic devices; and causing an inference engine of each of thesecond one or more electronic devices to utilize the optimized fourthone or more files to generate inferences.
 3. The computer-implementedmethod of claim 1, wherein: the first ML framework comprises one ofMXNet, Caffe, TensorFlow, or PyTorch; and the second ML frameworkcomprises one of MXNet, Caffe, TensorFlow, or PyTorch.
 4. Acomputer-implemented method comprising: receiving, at a providernetwork, a request to deploy a machine learning (ML) model to aplurality of electronic devices outside of the provider network, therequest identifying the ML model or a location of the ML model, the MLmodel being of a first format associated with a first ML framework, therequest further identifying the plurality of electronic devices of aplurality of different hardware configurations; translating a first oneor more files of the ML model in the first format into a second one ormore files of a second format; optimizing the second one or more filesbased on at least one characteristic of the plurality of electronicdevices, the optimizing including performing one or more of layerfusion, quantization, optimal scheduling, or kernel fusion; causing theoptimized second one or more files to be provided to an inference engineof each of the plurlaity of electronic devices; and receiving, at theprovider network, a second request to deploy a second ML model to theplurality of electronic devices outside of the provider network, thesecond request identifying the second ML model or a second location ofthe second ML model, the second ML model being of a third formatassociated with a second ML framework that is different from the firstML framework.
 5. The computer-implemented method of claim 4, furthercomprising: obtaining, from a storage service of the provider network,the first one or more files from the location.
 6. Thecomputer-implemented method of claim 5, further comprising: receiving arequest to train the ML model; training, within the provider network,the ML model; and outputting the first one or more files of the ML modelto the location provided by the storage service.
 7. Thecomputer-implemented method of claim 4, further comprising: receiving,at the provider network from a computing device of a user, a compressedfile storing the first one or more files of the ML model.
 8. Thecomputer-implemented method of claim 4, wherein the translating and theoptimizing are performed within the provider network.
 9. Thecomputer-implemented method of claim 4, wherein the translating and theoptimizing are performed by the plurality of electronic devices.
 10. Thecomputer-implemented method of claim 4, further comprising receiving, atthe provider network, a second request to deploy a second ML model to asecond plurality of electronic devices located outside of the providernetwork, the second request identifying the second ML model or a secondlocation of the second ML model, the second ML model being of a thirdformat generated by a second ML framework that is different from thefirst ML framework.
 11. The computer-implemented method of claim 1,wherein: a first of the plurality of electronic devices utilizes a firstarchitecture comprising either an x86 architecture or an ARMarchitecture; and a second of the plurality of electronic devicesutilizes a second architecture that is different than the firstarchitecture.
 12. The computer-implemented method of claim 4, whereinthe optimizing further includes performing optimizations based on thedevice context of the plurality of electronic devices.
 13. Thecomputer-implemented method of claim 4, wherein the first ML frameworkcomprises one of MXNet, Caffe, TensorFlow, or PyTorch.
 14. Thecomputer-implemented method of claim 4, wherein: the request comprises aHyperText Transfer Protocol (HTTP) request message; and the request isreceived at a web service endpoint of the provider network.
 15. A systemcomprising: a storage service implemented by a first one or moreelectronic devices of a provider network; and an edge device managementservice implemented by a second one or more electronic devices of theprovider network, the edge device management service includinginstructions that upon execution cause the edge device managementservice to: receive a request to deploy a machine learning (ML) model toa plurality of electronic devices outside of the provider network, therequest identifying a location of the ML model within the storageservice, the ML model being of a first format associated with a first MLframework, the request further identifying the plurality of electronicdevices of a plurality of different hardware configurations; translate afirst one or more files of the ML model in the first format into asecond one or more files of a second format; optimize the second one ormore files based on at least one characteristic of the plurality ofelectronic devices, the optimizing including performing one or more oflayer fusion, quantization, optimal scheduling, or kernel fusion; sendthe optimized second one or more files to the storage service to bestored, wherein the storage service is to, upon receipt of one or morerequests from the plurality of electronic devices, transmit theoptimized second one or more files to the plurality of electronicdevices to be provided to an inference engine of each of the pluralityof electronic devices; and receive a second request to deploy a secondML model to the plurality of electronic devices outside of the providernetwork, the second request identifying a second location of the secondML model within the storage service, the second ML model being of athird format associated with a second ML framework that is differentfrom the first ML framework.
 16. The system of claim 15, furthercomprising: a ML service implemented by a third one or more electronicdevices of the provider network, the ML service including instructionsthat upon execution cause the ML service to: receive a request to trainthe ML model; train the ML model; and send the first one or more filesof the ML model to the storage service to be stored at the locationprovided by the storage service.
 17. The system of claim 15, wherein thestorage service is further to: receive, from a computing device of auser, a compressed file storing the first one or more files of the MLmodel.
 18. The system of claim 15, wherein the second format is a commonformat configured to be run by the different hardware configurations ofthe plurality of electronic devices.
 19. The system of claim 15, whereinto optimize the second one or more files, the edge device managementservice is to further perform: optimizations based on the device contextof the plurality of electronic devices.
 20. The system of claim 15,wherein: the request comprises a HyperText Transfer Protocol (HTTP)request message; and the request is received at a web service endpointof the provider network; using the model variant to perform inference.